Automatic Hierarchical Classification of Structured Deep Web Databases

نویسندگان

  • Weifeng Su
  • Jiying Wang
  • Frederick H. Lochovsky
چکیده

We present a method that automatically classifies structured deep Web databases according to a pre-defined topic hierarchy. We assume that there are some manually classified databases, i.e., training databases, in every node of the topic hierarchy. Each training database is probed using queries constructed from the node titles of the topic hierarchy and the query result counts reported by the database are used to represent the content of the database. Hence, when adding a new database it can be probed by the same set of queries and classified to a node whose training databases are most similar to the new one. Specifically, a support vector machine classifier is trained on each internal node of the topic hierarchy with these training databases and the new database can be classified into the hierarchy top-down level by level. A feature extension method is proposed to create discriminant features. Experiments run on real structured Web databases collected from the Internet show that this classification method is quite accurate.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

A Survey of Automatic Deep Web Classification Techniques

To devise vision of the next generation of the web, deep web technologies have gained larger attention in a last few years. An eminent feature of next generation of web is the automation of tasks. A large part of Deep web comprises of online structured domain specific databases that are accessed using web query interfaces. The information contained in these databases is related to a particular ...

متن کامل

A Bootstrapping Approach to classification of Deep web Query Interfaces

Classification of Deep web sources is a very important process for the data extraction process while accessing the deep web content since it deals with domain specific data only. The existing methods cannot effectively classify these web databases. Hence, to solve this problem, we propose a new framework that uses the bootstrapping approach for automatic and accurate classification of the query...

متن کامل

Classification of Deep Web Databases Based on the Context of Web Pages

New techniques are discussed for enhancing the classification precision of deep Web databases, which include utilizing the content texts of the HTML pages containing the database entry forms as the context and a unification processing for the database attribute labels. An algorithm to find out the content texts in HTML pages is developed based on multiple statistic characteristics of the text b...

متن کامل

Hierarchical Classification for Multiple, Distributed Web Databases

The proliferation of online information resources increases the importance of effective and efficient distributed searching. Our research aims to provide an alternative hierarchical categorization and search capability based on a Bayesian network learning algorithm. Our proposed approach, which is grounded on automatic textual analysis of subject content of online web databases, attempts to add...

متن کامل

Exploration of Deep Web Repositories

With the proliferation of online repositories (e.g., databases or document corpora) hidden behind proprietary web interfaces, e.g., keyword-/form-based search and hierarchical/graph-based browsing interfaces, efficient ways of exploring contents in such hidden repositories are of increasing importance. There are two key challenges: one on the proper understanding of interfaces, and the other on...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2006